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Abstract 

In this paper we consider the rate distortion problem of discrete-time, ergodic, and stationary sources with feed 
forward at the receiver. We derive a sequence of achievable and computable rates that converge to the feed-forward 
rate distortion. We show that, for ergodic and stationary sources, the rate 

Rn(D) = -minllX" -> X") 

n 

is achievable for any n, where the minimization is taken over the transition conditioning probability p(x n \x n ) such 
that E [d(X n , X™)J < D. The limit of R n (D) exists and is the feed-forward rate distortion. We follow Gallager's 
proof where there is no feed-forward and, with appropriate modification, obtain our result. We provide an algorithm 
for calculating R n (D) using the alternating minimization procedure, and present several numerical examples. We 
also present a dual form for the optimization of R n (D), and transform it into a geometric programming problem. 

Index Terms 

Alternating minimization procedure, Blahut-Arimoto algorithm, causal conditioning, concatenating code trees, 
directed information, ergodic and stationary sources, geometric programming, ergodic modes, rate distortion with 
feed-forward. 



I. Introduction 



The rate distortion function for memoryless sources is well known and was given by Shannon in his seminal 
work HI. Shannon (l] showed that the rate distortion function is the minimum of mutual information between the 
source X and the reconstruction X, where the minimization is over transition probabilities p(x\x) such that the 



distortion constraint is satisfied, i.e., E 



d(X,X) 



< D. In the case where the source is stationary and ergodic, 



Gallager J2) showed that the rate distortion is the limit of the following sequence of rates. Each member of the 
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sequence is the nth order rate distortion function, which is the solution of the following minimization problem 

-min/(X";X n ). 

n 

The minimization is over all conditional probabilities p(x n \x n ) such that the distortion constraint is satisfied, i.e., 
E d(X n ,X n ) < D. Gallager showed that the limit of the sequence — min/(X™;X n ) exists and is equal to the 
infimum of the sequence. 

The problem of source coding with feed-forward was introduced by Weissman and Merhav Q and by 
Venataramanan and Pradhan |4), and is depicted in Fig.Q] Weissman and Merhav Q named the problem Competitive 



X' 



Encoder 



T{X n ) e {l,2,...,2 ,lK } 



Delay s 



Decoder — X n {T,X n ~ s ) 



X" 



Fig. 1 : Source coding with feed-forward: the decoder knows the source with delay s, and needs to reconstruct the 
source within the constraint E d(X n ,X n ) < D. 



Predictions. In their work, they defined a set of functions that predict the following Xj given the previous X s-1 . 
After defining the loss function between and the prediction, the objective was to minimizing the expected loss 
over all sets of predictors of size M. An important result in Q is that in the case where the innovation process 
Wi — X 1 — Fi(X % ~ x ) is i.i.d. the distortion-rate with feed-forward function is the same as the distortion-rate 
function of Wi, where there is no feed-forward. In particular, if Xj is an i.i.d. process, then Wi = Xj and thus the 
distortion-rate with feed-forward for the source Xj is the same as if there is no feed-forward. 

Venkataramana and Pradhan j4] gave an explicit definition of the rate distortion feed-forward for an arbitrary 
normalized distortion function and a general source. Their goal was to provide the rate R of a source given a 
distortion D using causal conditioning and directed information. The source of information is modeled as the 
process {X„} and is encoded in blocks of length n into a message T 6 {1, 2, 2 nR }. The message T (after 
n time units) is sent to the decoder that has to reconstruct the process {X n } using the message T and causal 
information of the source with some delay s as in Fig. [T] 

For that purpose, Venkataramanan and Pradhan Jf) defined the measures 



I(X — > X) = limsup — log 



p(X n ,X n ) 



and 



inprob n P (X n \\X n - s )p(X n ) 
p(X",X") 



UX -+X) = liminf-log- 

inprob n p(X n \\X n - s )p(X n ) 

The limsup in probability of a sequence of random variables {X n } is defined as the smallest extended real number 
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a such that Ve > 0, 

lim Pr[X n > a + el = 0, 

n—too 

and the liminf in probability is the largest extended real number j3 such that Ve > 0, 

lim Pr[X„ < j3 - e] = 0. 

n— >oo 

The main result in |4| is that for a general source {X n } and distortion D, the rate distortion with feed-forward 
R(D) is given by 

R(D) = MI(X -> X), 

where the infimum is evaluated over the set V of probabilities {p(x n \x n )} n >i that satisfy the distortion constraint. 
Moreover, if 

I(X -> X) =l(X -> X), 
Venkataramana and Pradhan showed in 0, that 

i?(L>) = inf lim -I(X n ->■ X"). 

The work of Venkataramanan and Pradhan has made a significant contribution since it gives a multi-letter 
characteristic for the rate distortion function with feed-forward. In [5], they evaluated these formulas for a stock- 
market example and provided an analytical expression for the rate distortion function. However, these types of 
formulas are still very hard to evaluate for the general case. In this paper we show that assuming ergodicity and 
stationarity of the source, the rate distortion function with feed-forward and delay s = 1 is upper bounded by 
R n (D), where 

R n (D) = - min I{X n ^X n ). (1) 

n p(x™\x"):E[d(X n ,X™)]<D 

We further show that the limit of the sequence {R n (D)} exists, is equal to inf„ R n (D), and is the rate distortion 
feed-forward function R(D). These expressions for R n (D) are computable using a Blahut-Arimoto-type algorithm 
or using geometric programming, as demonstrated here. 

In most models with causal constraints, such as feedback channels or feed-forward rate distortion, the causal 
conditioning probability, as well as the directed information characterizes the fundamental limits. In order to address 
these models, the causal conditioning probability was introduced by Massey and Kramer JTl and is defined as 

n 

p(x n \\x n - s ) = l[p(x i \x i - 1 ,x i - s ). (2) 

i=l 

The difference between regular and causal conditioning is that in causal conditioning the dependence of Xi on 
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future Xj is not taken into account. Following the causal conditioning probability, Massey |6| (who was inspired 
by Marko's work JS] on Bidirectional Communication) introduced the directed information, defined as 

I(X n -> X n ) 4 H(X n ) - H(X n \\X n ) 

71 

= J2i(x i ;X i \x i - 1 ). 

4=1 

The directed information was used by Tatikonda and Mitter J9), Permuter, Weissman, and Goldsmith |[T0) , and 
Kim UH to characterize the point-to-point channel capacity with feedback. It is shown that the capacity of such 
channels is characterized by the maximization of the directed information over the input probability p(x n ). In a 
previous paper JT2], we used these results and obtained bounds to estimate the feedback channel capacity using a 
Blahut-Arimoto-type algorithm (BAA) for finding the global optimum of the directed information. 

The main contribution of this work lies in extending the achievability proof given by Gallager in |2| to the case 
where feed-forward with delay s = 1 exists. The extension is done by using the causal conditioning distribution, 
p(x n | \x n ~ s ), rather than the regular reconstruction distribution p(x n ), in order to construct the codebook. The proof 
given is for s = 1, but can be extended straightforwardly to any delay s > 1. The difficulty in this modification is 
that while in J2) the codebook was an ensemble of sequences (code words) from the reconstruction alphabet using 
p(x n ), our codebook is an ensemble of code trees using p(x n \ x n ~ s ). This induced a major problem while showing 
that the probability of error is small, as discussed in Section [HI] These difficulties were overcome by appropriate 
modification to Gallager's proofs. 

Another contribution of this paper is the development of two optimization methods for obtaining R n (D); a BA- 
type algorithm and a geometric programming(GP) form. The GP form is given as a maximization problem, which 
can be solved using standard convex optimization methods. Further, this maximization problem gives us a lower 
bound to the rate distortion with feed-forward, which helps us decide when to terminate the algorithm. 

The remainder of the paper is organized as follows. In Section HI] we describe the problem model, provide the 
operational definition of the rate distortion function with feed-forward, and state our main theorems. In Section |TTT] 
we show that R n (D) is an achievable rate for all n and any distortion D, and in Section |IV] we show that the limit 
of R n (D) exists and is equal to the operational rate distortion function. In Section [V] we present an alternative 
optimization problem for R n {D) in a standard geometric programming form that can be solved numerically using 
convex optimization tools. In Section [VI] we give a description of the BAA for calculating R n (D) and present the 
algorithm's complexity and the memory required, and in Section IvTll we derive the BAA and prove its convergence 
to the optimum value. Numerical examples are given in Section IVIIII to illustrate the performance of the suggested 
algorithms. 

II. Problem Statement and Main Results 

In this section we present notation, describe the problem model and summarize the main results of the paper. 
We first state the definitions of a few quantities that we use in our coding theorems. We denote by X{ 1 the vector 
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(Xi,X%, ...X n ). Usually we use the notation X n = Xf for short. Further, when writing a probability mass function 
(PMF) we simply write Px{X = x) = p(x). An alphabet of any type is denoted by a calligraphic letter X, and its 
size is denoted by \X\. 

In the rate distortion problem with feed-forward of delay s = 1, as shown in Fig. Q] we consider a general 
discrete, stationary, and ergodic source {X n }, with the nth order probability distribution p(x n ), alphabet X and 
reconstruction alphabet X. The normalized bounded distortion measure is defined as d : X n x X" — > R + on pairs 
of sequences. 

Definition 1 (Code definition) A (n,2 nR ,D) source code with feed-forward of block length n and rate R consists 
of an encoder mapping /, 

f:X n ^{l,2,...,2 nR }, 

and a sequence of decoder mappings gi, i = 1, 2, n, 

9l :{1, 2, 2 nR } x X 1 ' 1 h-> X, i = 1,2, n. (3) 

The encoder maps a sequence x n to an index in {1,2,..., 2 nR }. At time i, the decoder has the message that was 
sent and causal information of the source, and reconstructs the ith symbol sent, ii. 

Definition 2 (Achievable rate) A rate distortion with feed-forward pair (R, D) is achievable if there exists a sequence 
of (n, 2 nR , D)-rate distortion codes with 



lim E 



d(X n ,X n ) 



< D. 



Definition3 (Rate distortion) The rate distortion with feed-forward function R(D) is the infimum of rates R such 
that (R, D) is achievable. 

In this paper, we define the mathematical expression for the rate distortion function as the following limit 

R {I \D) - lim R n (D), (4) 

n— »oo 

where R n {D) is the nth order rate distortion function given by 

Rn(D) = - min I(X n X n ). 

n p (x"\x™):E[d(X™,X n )]<D 

We show that the limit in © exists, R n (D) is achievable and upper bounds R^\D) for all n. Further, we show 
that the rate distortion feed-forward function, R(D), is equal to R^(D). We also provide two ways to calculate 
numerically the value R n (D); using a BA-type algorithm and a geometric programming form. 
We now state our main theorems. 

Theorem 1 (Achiev ability of R n (D)) For a discrete, stationary, ergodic source, and for any D, any n and delay 
s = 1, R n (D) is an achievable rate. 
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Theorem 2 (Rate distortion feed-forward) For any distortion D, the operational rate distortion function R(D) is 
equal to the mathematical expression, R^(D), where R^'(D) is given by 

Theorem 3 The nth order rate distortion function R n (D) can be written in a geometric programming standard form 
as the following maximization problem 

R n (D) = max - ( -XD + ^p(x n ) log 7 (x n ) ) , (5) 

subject to the constraints: 

n 

log(p(x")) + log( 7 (a:™)) - Xd(x n ,x n ) -J2^gp'{x l \x l -\x t ) < 0, V x n ,x n , 

i=l 

J^P'i^X 1 - 1 ,^) = 1, V ^X'- 1 ,!'- 1 , 

Xi 

A > 0. 

Theorem 4 (Algorithm for calculating R n {D)) For a fixed source distribution p(x n ), there exists an alternating 
minimization procedure in order to compute 

R n (D) = - min I(X n X n ). (6) 

n p(x n \x n ):E[d(X n ,X")]<D 

Proofs to Theorem [T] and |2] are given in Section Hill and Section HVl respectively. The proof for Theorem [3] is in 
Section |VJ the algorithm in Theorem |4] is described in Section [VI] and proved in Section IVIII 



III. ACHIEVABILITY PROOF (THEOREMIB. 

In this section we show that if the source is stationary and ergodic, then R n (D) as given in (O is achievable for 
any n. In order to do so, we first assume that the source is ergodic in blocks of length n, and show achievability. A 
source that is ergodic in blocks is one that, by looking at each n letters as a single letter from a super alphabet, we 
obtain an ergodic super source (presented in |]2] Chapter 9.8]). Then, for the general ergodic sources, we follow a 
claim given in J2] about ergodic modes, as explained further on. The distortion is assumed to be normalized, finite, 
and of the form 

1 " 

d(x n ,x n ) = -J]d(xi_ m ,x i ), (7) 
n * — ' 

i=l 

for some to. An example for such a distortion can be found in and in Section IVIII1 in an example called the 
stock-market. 

Theorem 5 Consider a discrete stationary source that is ergodic in blocks of length n. For any distortion D such 
that R n (D) < oo and <5 > 0, and for any L sufficiently large, there exists a codebook of trees 7c of length L with 
\Tc\ < 2 L(i? " ( ' D ) + ' 5) code trees for which the average distortion per letter satisfies E d(X L ,X L ) < D + 5. 
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Proof: Let p(x n \x n ) be the transition probability that achieves the minimum R n (D) and let p(x n \\x n 1 ) be 
the causal conditioning probability that corresponds to p(x n )p(x n \x n ). 

* Code design. For any L, consider the ensemble of codes 7c with 7c = [2 L (- R ™(- D ) +|5 )J code trees of length 
L, where each code tree t l G 7c is a concatenation of L/n sub-code trees of length n. Each sub-code tree is 
generated independently according to p{x n \\x n ~ 1 ) as in Fig. [2] 



P{xi] 
i 



Xl 



p(x 2 \xi,xi) p{x A ) p(xq\x\,x%) 



p{x 3 \x\,x\) 



I 



Xl = 1 



•i'2 



xi = 



p(x 5 |i 4 ,a;4) 



Code tree 1 



Code tree 2 



Fig. 2: Concatenation of two code trees, each of length n = 3. The upper branches are for x t = 1, and the lower 
branches are for x; = 0. 



• Encoder. The encoder assigns a code tree t l £ 7c for every x L such that d(x L , x l (t l , is minimal. 

The sequence x l (t l ,x l ~ 1 ) is determined by walking on tree t l , and following the branch x L ~ 1 . 
> Decoder. At time i, the decoder possesses the index of the tree t l and causal information of the source x 1 " 1 , 

and returns the symbol Xi(r L ,x t ~ 1 ) that it produces. 
Let us define a test channel as the conditional probability 

L/n-l 

Pl(x l \x l )= [] p{x%#\x%#), (8) 



and the causal conditional probability 



i=0 



L/n-l 



pl {x l \\x l - i )= ; P (i 



,ni+n 1 1 ni+n— 1) 
ni+1 I \ x ni+l I 



2 = 

where the distribution is according to 



P jtmix ^(x n \x n )=P tnlxn {x n \x n ), 
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Moreover, we define for every code tree t l of length L the measure 



I n {r L ^x L ) = \og P f^lK-iv 
p L (x L \\x L !) 



where x L = x l (t l , x L 1 ). Note that I n (r L — > x L ) is not the directed information between the sequences x L , x L , 
but simply a measure between a source sequence x L and the output, x L of the test channel pi J (x L \x L ), as defined 
in ©. 

Let T be the set of all code trees of length L, and consider the following set, 

A= {t l eT,x L e X L : either I n {r L -»• x L ) > L(R„(D) + 5/2) or d(x L , x l (t l , x^ 1 )) > L(D + 5/2)}, 

(10) 

and let p t (A) be the probability of the set A on the test channel ensemble. 
Let us use the notation 

x L (Tc,x L - 1 )=x L ( arg min d(x L , x l (t l , x^ 1 )) , x L ) , 

V T L £T C J 

where Tc is the ensemble of code trees as described in the coding scheme. Now, let p c (d(X L , x L (Tc, X L ~ X )) > 
LD) be the probability over the ensemble of codes 7c an d source sequences such that the distortion exceeds LD. 
We wish to give an upper bound to the probability p c (d(X L , x L (Tc, X L ~ X )) > LD); for this we use the following 
lemma. 

Lemma 1 For a given source {Xi}i>i and test channel, we have the following inequality 

Pc (d(X L ,x L (T c ,X L ~ 1 ))> LD) <p t (A) +cxp{-\T c \2- LR ^}, (11) 
where the set A is described in (ITOb . 
Proof. We first write p c (d(X L , x L (T Cl X L -^) > LD) as 

p c (d(X L ,x L (Tc,X L - 1 )) >LD)= J2 P( xL )Pc (d(X L 7 x L (T c ,X L - 1 )) > LD\X L = x L ) . 

x L ex L 

For every x L , let us define the set A x l as the set of all code trees t l G T for which (t l , x L ) 6 A, 

A x l = {t l e T : either I„(t l -> x L ) > L(R„(D) +5/2) or d{x L , x l {t l , x 1 "' 1 )) > L(D + 5/2)}. (12) 

We observe that d{x L ,i L {Tc,x L ~ 1 )) > LD for a given x L only if d(x L , x l (t l , x L ~ r )) > LD for every 
t l 6 Tc- Thus, d(x L , x L (Tc, x 1 "^ 1 )) > LD only if t l £ A x l for every t l G Tc- Since t l are independently 
chosen, 

Pc {diX^x^TcXt- 1 )) > LD\X L =x L ) < (p t (A^)) lTcl 

= (l- ft (^)) |Tcl , 
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where A° L is the complement set of A x l. We note that the probability that tree t l being in A c l depends only 
on the branch associated with x L . In other words, if a tree t l G A c l , then all other trees with the same branch 
associated with x L is in A c xL as well; the same goes for A x l. Hence, we can divide the set of all code trees T 
into disjoint subsets B x l x l that have the same branch associated with x L ~ 1 , i.e., 

where t l (x l ~ 1 ) is a walk on tree t l over the branch x . Clearly, the probability of each subset B x l x l is 

Pt(B x L x l) =p L {x L \\x L ~ 1 ) 

since the left hand side is a summation of the probabilities of all trees with the same branch associated with x L , 
and we are left with the probability of that one branch. 



Now, for every r L £ B x l x l C A° l , and due to the definition of A xL , we have 

I n (r L -> x L ) = log /fif^X. < LR n (D). 



Therefore, 



PL 



PL^HxL- 1 ) 



(^ll^- 1 ) >Pl(x l \x l )2- lh ^ d \ 



(13) 



and we obtain that 



Pc (d(X L ,x L (T c ,X L - 1 )) > LD\X L = x L ) < (1 ~ Pt (A xL )) lTcl 

1- Y Vt{B x L 



\Tc\ 



( ^ '' 

1- ]T Vl{x L \\x l ~ x ) 



< [ i-2- L ^°) Yl Pl(x L \x l ) 



\Tc\ 



where (a) follows the inequality in equation (1131 1. 



Using the inequality (l-ab) k < l-a + cxp{-6fc}, and taking a = Y1x l -b cA" Pl(x l \x l ), b = 2 Li? "( £l ), 
we find 

Pc (d(X L 1 x L (T c ,X L - 1 )) >LD\X L =x L ) <1- Y Pl(x l \x l ) + cxp{-\Tc\2- LRn{D] }. 

x L :B^ L - L CA 



x 



1(1 



By taking a sum over x L we remain with 

Pc (d(X L ,x L (T c ,X L - 1 )) > LD) =^p(x L ) Pc (d(X L ,x L (T c ,X L - 1 )) >LD\X L =x 1 



<5>(x L ) 1- PL{x L \x L ) + eM-\Tc\2- LR " {D) } 

= 1 ~E E p(x L ,i L ) + exp{-|T c |2- ii? " (z?) }. (14) 

a; 1 - i L :B Tl , t U e , 



Note, that 



£ E p( xL i% L ) = E E E^* 1 ^) 

>E E E p{x l ,x l ,t l ) 

x L x l :B x l ;1 C^ t r^SB^i i4 t 

= E E E P (-V L ) 
= E E p(*v £ ) 

where (a) follows the fact that if t l £ B x l then x L is determined by the tree t l and the branch x L . Now, 
continuing from equation (fl4l i. we obtain 

Pc (d^^CTo,^- 1 )) >LD) < l-p t (X) + exp{-|Tc|2~ Lfl " (D) } 

= ft (^)+exp{-|r c |2^ Lfl "^}. (15) 



We now use the result in ( fTBI l in order to complete the proof of the theorem. Furthermore, we can see that the 
average distortion of the code satisfies 
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d(X L ,X L (< (D + 5/2)+ Pc (d(X L ,x L (Tc,X L - 1 )) > L{D + 8/2)) ■ sup d(x L ,x L ) 



This arises, as in J2] Th. 9.3.1], from upper bounding the distortion by D + 5/2 when the d(x L ,x L ) < D + 5/2, 
and by 



sup d(x ,x L ) 
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otherwise. By choosing \Tc\ = [2 L (- R ™(- D )+' 5 ) J, the last term in (fT3T > goes to zero with increasing L. Furthermore, 
the first term is bounded by 

Pt(A) <p t {x L &X l ,t l eT: I n (r L ^x l ) > L(R n (D)+S/2)} 

+ Pt{x L e X l ,t l eT : d(x L ,x L (T L ,x L - 1 )) >L(D + 6/2)}. (16) 



Note that 



Pi 



1 -i L l n 1 n (A.ni+n\ ni+n\ 

I n (r L -> x L ) > L(—R„,(D) + 5/2) )=Pt\j £ lo S ^ > ^) + W 



As assumed, the source is ergodic in blocks of length n. Furthermore, the test channel is defined to be memoryless 
for blocks of length n, and hence the joint process is ergodic in blocks of length n. Thus, with probability 1, 



I i im J_ y log ^f 1 ^ 1 = -K 



p(x n \x n ) 

los- 



p(x"||x" - 1 ) 

= Rn(D). 

Therefore, the probability of the first term in ([16) goes to zero as L goes to infinity, and the same goes to the 
second term due to the definition of the distortion. In order to finish the proof, and due to the fact that p c goes to 
zero with increasing L and the fact that the distortion is finite, we can choose L large enough such that 

p c (d(X L ,x L (T c , X^ 1 )) > L{D + 5/2)) ■ sup d{x L ,x L ) < 5/2. 

X L ,X L 

In this case, we obtain Dl < D + 5, and hence the rate R n (D) is achievable for sources that are ergodic in blocks 
of length n. ■ 

Much like in Gallager's proof for the case where there is no feed-forward, we note that not all ergodic sources 
are also ergodic in blocks, and we need to address these cases as well. For that purpose, we need |f2] Lemma 
9.8.2] for ergodic sources. We recall, that a discrete stationary source is ergodic if and only if every invariant set of 
sequences under a shift operator T is of probability 1 or 0. In J2] Chapter 9.8], the author looks at the operator T n , 
i.e., a shift of n places, and considers an invariant set So, p(So) > 0, with respect to T n . In Lemma 9.8.2 in 13, 
it is stated that one can separate the source S to n' invariant subsets {5, = T^So)}™^ 1 , p(Si) = A, with regard 
to T n , such that n' divides n and the sets Si, Sj are disjoint except, perhaps, an intersection of zero probability. 
These subsets are called ergodic modes, due to the fact that each invariant subset of them under the operator T n is 
of probability or -K. In other words, conditional on an ergodic mode Si each invariant subset of it with respect 
to T n , is of probability or 1. 



Recall, that by definition, 



Rn(D) = -/„(!" -+X n ), 
n 



12 



where the right-hand side is the average directed information between the source and reconstruction, determined 
according to p(x n \x n ) that achieves R n (D). Let I n (X n —> X n \i) be the average directed information between 
a source sequence from the ith ergodic mode and the ensemble of codes, using the probability p(x n \x n ) which 
achieves R n (D). Note that the directed information can be written as 

I n (X n ^X n ) = J2 p(x n )p(x n \x n )log f^Tn-u 

p(x n \\x n 1 ) 

= D (p(x n )p(x n \x n )\\p(x n \\x n - 1 )p(x n )) , 
which is convex over the input probability p(x n ). Thus, 

-j n — 1 

i n (x n ->. x n ) > - V i n (x n 



n 

i=0 



(17) 



We observe that —I n (X n — >• X n \i) is an upper bound to the nth order rate distortion function conditional on the 
ith ergodic mode. From Theorem|5] we know that there exists a codebook 7c, with |7c< | = [2 L ^n I "-( x ' n ^ x "'\ t )+ s ) j 
code trees of length L such that the average distortion constraint holds. Another observation is that if a codebook 
Td satisfies the distortion constraint, conditional on the ergodic mode Si, then it has the same effect conditional on 
the ergodic mode T(Si-i). In other words, we can encode not only a source sequence from Si—i with Td- X , but 
also a shift of the a source sequence in Si-\ with 7cv We use these observations while constructing the codebook. 

We can now prove Theorem Q] i.e., the achievability of R„(D), where the source is ergodic and stationary. An 
equivalent version of Theorem[T]is the following: let R n (D) be the nth order rate distortion function for a discrete, 
stationary, and ergodic source. For any D such that R n (D) < oo, and S > 0, and any L sufficiently large, there 
exists a codebook of trees 7c of length L with \Tc\ < 2 L ( R "( D * )+S ' > code trees for which the average distortion per 



letter satisfies E 



d{X n ,X r 



<D + S. 

Proof of Theorem\J} Let p(x n \x n ) be the transition probability that achieves R n (D) and let p(x n \\x n ~ 1 ) be 
the causal conditioning probability that corresponds to p(x n )p(x n \x n ). 

• Code design. For any L and any ergodic mode Si, < i < n', construct an ensemble of codes Tc i , with 
\7Ci\ = [2 L (™ In ( x l l )+ 5 )J 'little' code trees of length L, where each Tittle' code tree is generated 
according to p(x L \\x L ~ 1 ), as in Fig.[2]in Theorem [5] above. Now, for every < i < n', the ith codebook is an 
ensemble of 'big' code trees, which are concatenation of n' 'little' code trees, starting from one in Td, and 
followed by one from Tc i+1 to one from Tc n , +i _ l , where the index is calculated modiolus n'. In the example 
of a 'big' code tree in Fig. [3] we see additional letters at the end of each Tittle' code tree, i.e., in positions 
L + 1, 2(X + 1), ...,n'(L + 1), that are fixed. The purpose of the fixed letters is to shift the sequence and 
encode it with a codetree from the sequential codebook. Note, that the overall length of a code tree sums up 
to L' = Ln' + n'. 
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Fixed letters 



— v y 

Codetree from 7c; + i Codetree from Tc i+2 



Codetree from 7ci 

Fig. 3: A code tree from the ith codebook, n = n' = 3, L = 6. 



• Encoder. For every i, the encoder assigns for every source sequence x L £ Si a code tree t l from the 
ith codebook, such that d(x L ,x L (r L ,x L is minimal. The sequence x L (t l ,x l _1 ) is determined by 
walking on tree t l , and following the branch x L _1 . 

« Decoder. The decoder receives a tree t l and causal information of x L and returns the sequence x L that it 
produces. 

Since the distortion constraint for every ergodic mode is satisfied due to Theorem [5] the overall distortion is 
satisfied as well. The additional fixed letters are of unknown distortion, but due to the face that the distortion is 
bounded, their contribution is negligible for large L. Moreover, note that for every i, the ith codebook is of the 
same size. Thus, the overall size of the codebook is 

n'-l 

\T c \ = n' J] \T Ct \ 

i=0 
n'-l 

< n > TT 2 L (7r / "(*"^"l'0+'5) 

i=0 

Recall that L' = Ln' + n! , and by letting 5' = 8 + lo ^^ ^ we conclude that R n (D) is an achievable rate for the 
general ergodic source, as required. ■ 

IV. PROOF THAT R{D) = R {I \D) (THEOREMS]). 

In this section we show that the operational description of the rate distortion with feed-forward is equal to the 
mathematical one given in d!8t . This will be done first by showing that the mathematical expression R^'{D) is 
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achievable, and then by showing that it is a lower bound to the rate distortion function. We recall that 

R {I) (D) = lim - min I(X n X n ), (18) 

n->oo n p (x n \x n ):E[d(X n ,X n )]<D 

To show that R^'(D) is achievable we first need to show that the limit of the sequence {R n (D)} exists. For 
this purpose, we use the following lemma. 

Lemma 2 The sequence R n (D), 

R n (D) = - min I(X n X n ), 

n p(x n \x n ):E[d(X n ,X n )]<D 

is a sub-additive sequence, and thus 

miR n (D) = lim R n (D). 

n n— loo 

Note, that a sequence {a n } is called sub-additive if for all m, I, 

(to + l)a m +t < ma m + Zoj. 

The proof for Lemma [2] is given in App. [A] 

We now state a lemma for the achievability of R^\D). 

Lemma 3 (Achievability of R^(D)) The mathematical expression for the rate distortion feed-forward R^(D) is 
achievable, and thus upper bounds R(D). 

Proof: We showed in Theorem Q] that for any n, R n (D) is achievable. Further, in Lemma [2] we show that the 
limit exists and equal to the infimum, and hence is achievable too. Therefore, we conclude that the mathematical 
expression R^(D) is achievable, and forms an upper bound to the operational description R(D). ■ 
To show that RW(D) is a lower bound to the rate distortion function, we provide the following lemma 

Lemma 4 (Converse) the mathematical expression R^(D) is a lower bound to the operational rate distortion 
function. 

For the completeness of the paper, we provide the proof of Lemma @] this in App. |B] However, similar proof was 
presented by Venkataramana and Pradhan in @, and their expressions involved limit in probability of the entropy 
and directed information as described in Section H] 

Proof of Theorem^} Combining Lemmas [3] |4] provides us with the proof for our fundamental theorem, stated 
in Section HT1 i.e., the operational rate distortion function R{D) is equal to the mathematical one, R^\D). ■ 

V. Geometric programming form to R n (D) (Theorem© 

In this section we show that the nth order rate distortion function with feed-forward R n (D) can be given as a 
maximization problem, written in a standard form of geometric programming. For this purpose we first state the 
following theorem. 
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Theorem 6 The 71th order rate distortion function, R n (D), can be written as the following maximization problem 

Rn(D)= max - \ -XD + Y p(x n )lo gl (x n )) , (19) 

A>0,7(x") n \ J 

where, for some causal conditioned probability p'{x n \\x n ), "f{x n ) satisfies the inequality constraint 

p(x n )-f(x n )2- Xd( - x "^ <p'(x n \\x n ). (20) 

In App. [C] we provide two proofs for Theorem |6j the first is similar to Berger's proof in [TL31 for the regular rate 
distortion function based on the inequality log(y) > 1 — ~, and the second uses the Lagrange duality as presented in 
lfl4l and lfT31 that transforms a minimization problem to a maximization one.. App. ICl also includes the connection 
between the rate distortion function and the parameter A, which states that the slope of R n (D) in point D is — ^. 

Proof of Theorem [J} Considering the theorem above, our interest now is to adjust the constraints in order 
to obtain a geometric programming form. We note that the optimization problem in (fT9] l does not change if we 
maximize over p'(x n \\x n ) as well, and the constraint (|20T i is no longer for some p', i.e., 

R n (D)= max - ( -XD + Vp(x") log 7 (x n ) ) , (21) 

A>0,7(x"),p'(x"||x") n \ J 

where 7(0;™), p'(x n \\x n ) satisfy the inequality constraint 

p(x n )j(x n )2- Xd{xn ^ <p'(x n \\x n ). (22) 

The above statement is true since, on the one hand, the maximization in (l9[ increases upon maximizing over 
another variable, p'(x n \\x n ), as in (|21t : on the other hand, the variable j*(x n ), p'*(x n \\x n ) that achieves (fJU 
satisfy the constraint (120b in Theorem [6] and hence the maximization problem in d2TT i cannot be greater than the 
one in ( fT9b . 

To obtain a geometric programming standard form we transform the constraint in d22l , such that 

p{x n )j(x n )2~ xd ^ n ' x ^p'(x n \\x n )~ 1 < 1. 
Taking the log of both sides, we obtain 

n 

\og(p(x n )) + log( 7 (x™)) - Xd(x n ,x n ) - \ogp'{x n \\x n ) < 0. 

Note that maximizing over p'{x n \\x n ) is the same as maximizing over its products {p 1 (x^x 1-1 : iflOl 
Lemma 3]. Therefore, we can conclude that the rate distortion with feed-forward R n (D) can be given as a geometric 
programming maximization form, 

R n (D)= max - f -XD + $>(a: n ) log 7 (x n ) ) , 
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subject to 

n 

\og( P (x n )) +log( 7 (z n )) - Xd(x n ,x n ) - ^logp'^l^" 1 ,^) < 0, V x n ,x n , 

i=l 

^2p'(x l \x l - 1 ,x z ) = 1, V i.Vi'^.r 1 , 
A > 0. 

Hence, we obtain a standard form of geometrical programming. This GP problem can be solved using standard 
convex optimization tools. ■ 

VI. Extension of the BAA for rate distortion with feed-forward 
In this section we describe an algorithm for calculating R n (D), where 

R n {D) = - min I(X n ^X n ), (23) 

71 r (x n \x n ):E[d(X n ,X n )]<D 

using the alternating minimization procedure. This method was first used by Blahut and Arimoto |[T6l . ifTTI to obtain 
a numerical solution for the i.i.d. source rate distortion and for the memoryless channel capacity. Recently, in lfl2l 
we extended this method for finding the global maximum of the following optimization problem- 

C n = - max I(X n -> Y n ), 

and we apply similar methods here. 

Before we describe the algorithm, let us denote by r = r(x n \x n ), q = q(x n \\x n ~ 1 ) the PMFs that are participating 
in the minimization. Further, let us consider the double optimization problem given by 

1 



Rn{D) 

n 



-XD + min K(r, q) 



(24) 



where 

K(r,q)=I FF (r,q) + \E r c 
and If f {t, q) is the directed information that can be written as 



I FF (r,q)=I(X n ^X n )= log jS-lv (25) 

In Section IVIII we show that the double optimization problem given in d24l i is equal to the one given in i23i . 
Equations J241 l. ( l25l l allow us to apply the alternating minimization procedure. 

A. Description of the algorithm 

In Algorithm [T] we present the steps required to minimize the directed information where the input PMF p(x n ) 
is fixed. The parameter A is used in the Lagrangian with which we optimize the directed information. The value of 
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Algorithm 1 Iterative algorithm for calculating R n (D), where p(x n ) is fixed. 

(a) Fix a value of A > that determines a point on the R n (D) curve. 

(b) Start from a random causally conditioned point q°(x n \\x n ~ 1 ). Usually we start from a uniform distribution, 

i.e., q°(x n \\x n ~ 1 ) = 2~ n for every (x n ,x n ). 

(c) Set k = l. 

(d) Compute r k (x n \x n ) using the formula 

„k—l ( "71 I \ n~l\r, — \d(x n ,x n ) 

r k (r n \r n ) = Q [ 11 ' 

v 1 ; J2^q k ^ 1 {x n \\x n - 1 )2~ Xd ( xn ^ n ')' 

(e) Calculate the joint probability p(x n ,x n ) = p(x n )r k {x n \x n ), and deduce the causal conditioned PMF 

q k (x n \\x n - 1 ) as in ©. 



(f) Calculate the parameter 



q k {x n \\x n - 1 ) 



30 ' x q k - x {x n \\x n - x y 
(g) Calculate 



X" ,x" 



(h) If F > e, set k := k + 1, and return to (d). 

(i) The rate distortion function, with distortion Dk = p(a; n )r fe (a; I, '|a;™)c?(x™, x n ), is 

r fe (i"|a; n ) 



n * — » 



n q k (x n \\x n J ) 



a: ' , x 



Dk and hence R n (Dk) depends on A; thus choosing A appropriately sweeps out the R n {Dk) curve. The algorithm 
stops when F < e. In App. [D] we provide upper and lower bounds, used show that if F < e, we ensure that 

\R k (D k ) - R n (D k )\ <e. 

Now, let us present a special case and a few extensions for Algorithm Q] 

(1) Regular BAA, i.e., the delay s — n. For delay s = n, the algorithm suggested here meets the original BAA, 
where instead of step (d) we have 



r k (r n \r^ = g [ ' 

\^ l A ; q k - 1 {x n )2- xd ^ xn ^ n ) ' 



and in step (e), q k (x n ) corresponds to the joint probability p(x n )r k (x n \x n ) as well. Moreover, the expression 
for c-„ is reduced to 

x n ,x n 1 



q k (x n ) 
q k ~ 1 {x n Y 



and the termination of the algorithm in step (g) is defined by 

F = logmaxc^„ - p(x n )r k (x n \x n ) Iogc|* < e, 

re™ ,x n 

as in the regular Blahut-Arimoto algorithm lTT6l . 
(2) Function of the feed-forward with general delay s. We present a generalization of the algorithm, where the 
feed-forward is a deterministic function of the source with some delay s, z l ~ s = f(x z ~ s ). In that case, step 
(d) is replaced by 



r k {x n \x n ) = 



„fc-i 



(x n | s )2~ ^( s ":^ n ) 

gfc-l(f™|| z "-s)2-Ad(^",S") ; 



and in step (e) we have 



q k (x n \\z n - s ) = l[p(x i \x i - 1 ,z i - s ), 



where we calculate p^^x 1 ,z l s ) from the joint distribution p(x n ,x n ) — p(x n )r (x n \x n ). The algorithm 



is terminated in the same way, where 



q k (x n \\z n - s ) 



B. Complexity and Memory needed 

Computation complexity and memory needed for the algorithm above is presented in Table U 

TABLE I: Memory and operations needed extended BAA for source coding with feed-forward. 





Operation 


Memory 


™™ P (^yM[d(xn,x«)}<D (^(*";W)> regular BAA 


o((|*||;e|f) 


(1*11*1) +|*|" + |*| 


™™ P ^y. E { d{ x~,x^}<D {vAX n -> X")), Alg.Hl 


o((|*||*|f) 


2(|*||*|)" + |*|" 



VII. DERIVATION OF ALGORITHM[TJ 

In this section, we first describe the alternating minimization procedure, and then (as given in Alg. [TJ prove its 
convergence to the global minimum given by 

R n (D) = - min I(X n -> X n ). 

n r{x rl \\x n ~ 1 ):E[d(X n ,X")]<D 
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Throughout this section, note that the input probability p(x n ) is fixed in all minimization calculations. Further, we 
denote by Ip p (r, q) the directed information, given by 

Ipp(r,q) = Y p(x n )r(x n \x n )\og ffiE- . 

SC n ,x n " ' 

The alternating maximization procedure is described in ifPJI by two maximization functions; c^iux) G which 
is the point that achieves sup, U9ej49 f(u\,U2), and c\(u2) G A\ which is the one that achieves sup„ lg ^ ll f (111,112). 
Although in this paper we wish to solve a minimization problem, its negative can be used in the alternating 
maximization procedure. We now state the alternating maximization procedure lemma. 

Lemma 5 (Lemmas 9.4, 9.5 in fTffl/ . "Convergence of the alternating maximization procedure") . Let f(ui,U2) be a 
real, concave, bounded from above function, that is continuous and has continuous partial derivatives, and let the 
sets Ai,A2, over which we maximize be convex. Further, assume that C2(ui) G A-i and c\(u-i) G A\ for all 
u\ Gili, !t2 e A2. Let us define an iteration as the following equation 

(u\,u\) = (ci(^- 1 ),c 2 (ci(^- 1 ))) , 

and in each iteration we consider the value f k = /(M^it^). Under these conditions, lim^oo f k = /*, where /* 
is the solution to the optimization problem. 

The rate-distortion function with feed-forward can be, as in fl6l . carried out parametrically in terms of parameter 
A, which is introduced as a Lagrange multiplier. In App. |D]we show that this parameter defines the slope of the 
curve R n (D) at the point it parameterizes, and the slope is given by — . We now write the following parametric 
expression for R n (D). 



R n {D) = - min \l(X n -> X n ) + A fE r \d(X n ,X n )} - d) 
n r(x n \x n ) L V L J / 



(26) 



Tl r(x n \x n ) 

where D is the distortion at the point r*{x n \x n ) that achieves R n (D). Here, the value of D is not an input to the 
minimization, but is determined by the parameter A. 

Note that the directed information is a function of the joint distribution p(x n )r(x n \x n ). Since the source 
distribution is given, the directed information Iff is determined by r = ?'(.T Tl |a; n ) alone. Let us define by 
q = q(x n \\x n ~ 1 ) the causal conditioning probability. Now, let us define the functional 

K{r, q) = I FF (r, q) + AE. r [d(X n ,X n )\ . (27) 

From d26*l i and ( f2Tb we can see, that R n (D) can be written as 

R n (D) = - \-XD + mmK(r,q) 
n L r 

where q(x n ~ 1 \\x n ) corresponds to the joint distribution p(x n )r(x n \x n ), and D is the distortion at the point 
r*(x n |a; n ) that achieves R n (D). 
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In this section, we show that we can use the alternating minimization procedure for computing R n (D). For this 
purpose, we present several lemmas that assist in proving our main goal. In Lemma [6] we show that the expression 
we minimize satisfies the conditions in Lemma [5] In Lemma [7] we show that we are allowed to minimize the 
functional K over r{x n \x n ) and q(x n \\x n ~ 1 ) together, rather than over r{x n \x n ) alone, and thus use the alternating 
minimization procedure to achieve the optimum value. Lemma [8] is a supplementary claim that helps us to prove 
Lemma [7] in which we find an expression for q{x n \\x n ^ 1 ) that minimizes the functional K where r(x n \x n ) is 
fixed. In Lemma|9]we find an explicit expression for r(x n \x n ) that minimizes the functional K where q(x n \\x n ^ 1 ) 
is fixed. Theorem [4] combines all lemmas to show that the alternating minimization procedure, as described in Alg. 
Q] converges. We end with a supplementary claim about the upper and lower bounds to the rate distortion, and then 
prove that the stopping condition described in Alg. Q] ensures that the error \R^(D) — R n (D)\ < e. From here on, 
we denote the probabilities over which we minimize as r = r(x n \x n ), q = q(x n \\x n ~ 1 ). 

Lemma 6 For a fixed input PMF p(x n ), the functional K given in d27l i as a function of {r, q} is convex in {r, q], 
continuous and with continuous partial derivatives. Moreover, the sets of probabilities r, q (denoted by A±, A^) 
over which we optimize are convex. 



, we 



Proof: Since the functional K consists of a linear (and thus convex) expression in r, i.e., E r d(X n , X 7 
only need to verify that the directed information is convex. We first write the directed information in the following 
form 

I{X n -> X n ) = - 5^ p(x n ,x n ) log P ^ 



p(x n \\x n ) 
p(a; n )g(i n ||a; n - 1 ) 



Ep(x n ,x n )log 
xn p(x n \\x n )q{x n \\x n - 1 ) 

'An 1 1 ™n— 1 > 



= -J2 p(* n ,4 n )log 



p(x n , x n )/p{x n ) 
q{x n \\x n - r ) 



= -J2 p(x n )r{x n \x n )\og 

^-^ r(x n \x n ) 
x n ,x n 

= I FF (r,q). 

This form is the negative of a concave function as proven in |[T2l Lemma 2]. Furthermore, in the same lemma we 
show that the directed information is continuous with continuous partial derivatives; the same explanation applies 
here. It is also simple to verify that both sets we minimize over are convex, i.e., sets Ai, A^, where 

A\ = {r(x n \x n ) : r(x n \x n ) > is a regular conditioned PMF}, 

M = {q{x n \\x n - x ) : g(.T"||a; 11 " 1 ) is a causally conditioned PMF}. (28) 



Recall that in order to use the alternating minimization procedure we minimize over {r(x n \x n ), q(x n \\x n 1 )} 
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instead of over r(x n \x n ) alone, and thus need the following lemma. 
Lemma 7 For any discrete random variables X n , X n , the following holds 

—XD + min K (r, q) 



Rn{D) = - 

n 



where D is the distortion at the point r*(x n \x n ) that achieves R n (D) 



To prove this lemma, we note that E r 
K. Hence, it suffices to show that 



d(X n ,X n ) 



, which does not contain the variable q, is part of the functional 



min -I(X n X n ) = min min -I(X n ->• X n ) (29) 

r(x n \x n ) Tl q(x n | \x n ~ 1 ) r(x n \x n ) Ti 

The proof is given after the following supplementary claim, in which we calculate the specific q(x n \\x n ~ 1 ) that 
minimizes the directed information when r(x n \x n ) is fixed. 

Lemma 8 For fixed r(x n \x n ), there exists a unique C2(r) that achieves mmg^^^-i-f I(X n — > X n ), and is given 
by 

where p{x n \\x n ) is calculated using the joint distribution p{x n )r{x n \x n ). 
Proof for Lemma [S} 

I FF (r,q) - Iff{t, q*) 
= V p(x")r(x"|x")log g 7 X " | l|a ' ra " i ? 

X n ,X n 

p(a;»||i»)(z(i»||i»- 1 ) 
= L> (p(o; n ||2 n )g*(£ n ||iB n - 1 ) || p^"!^")?^"!!^"" 1 )) 

(a) 

> 0, 

where (a) follows from the non-negativity of the divergence. Equality holds if and only if the joint PMFs are the 
same, i.e., q = q*. ■ 

Proof of Lemma [7} The PMF that minimizes the directed information is the one that corresponds to the joint 
distribution r(x n \x n )p(x n ); thus d29l holds, and thus the functional K can be minimized over both r, q combined. 
■ 

In the following lemma, we derive an explicit expression for r(x n \x n ) that achieves R n (D), where q(x n \\x n ~ 1 ) 
is fixed. 
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Lemma 9 For fixed q{x n \\x n r ), there exists c\{q) that achieves R n (D), and is given by 

q(x n \\x n - 1 )2~ xd< - x "< & "') 



r{x n \x n ) 



J2x" q(x n \\x n - 1 )2- Xd ( x "' S: ' n ) 



Proof: Following |[T4l Ch. 5.5.3], since we are solving a convex optimization problem, we can apply the KKT 
conditions with the constraints r(x n \x n ) = 1, and set up the functional 

J= ]T p(.x>(.x'>")log Sn-i) + A ( E P{x n )r{x n \x n )d{x n ,x n )-D\ + £ v(x n ) £ r(x n \x n ). 
Solving o r (?n\ x n} = yields the expression for r(i™|a; n ) as 



<7(i™ | \x n ~ 1 )2~ xd ^- x " 



Another lemma that is required is one that states that the algorithm, when converges, remains fixed on its variables, 
we already know that the variable q that optimize the directed information is unique; we have to show that within 
the algorithm, the variable r is unique as well. 

Lemma 10 Using the iterations in Alg.[T| the variable r is unique, and does not change if convergence is achieved. 

Proof: The uniqueness is proven in a similar way to a proof given by Blahut in lfl6l Theorem 6], and we 
follow it with appropriate modifications. We recall that in the fcth iteration, 

K(r k iq k ) = I FF (r k \q k ) + XE rk \d(X n ,X n ) 



p{x n )r k {x n \x n )\og- 



qk 1 1 x n ~ 1 )2 — ^d(x n .x n ) 

Further, from [fl6| Theorem 6] we can see that 

K(r k +\q k + 1 ) = - £ p(a;")r fc (x"|a ; ")log(^g fe (x"||a;™- 1 )2- A ^",x")j + £ p ( x n )r k+1 (x n \x n ) log fj^} 

x n ,x n \ x n / x n ,x n Q \ 

Hence, 

r k (x n \x n )Y,$n q k (x n \\x n ~ 1 )2- Xd ( x "' x "'> 



K(r k ,q k )- K{r k+ \q k+1 ) = p{x n )r k {x n \x n )\og 



J2 p{x n )r k+1 {x n \x n )\og 



q k (x n \\x n - 1 )2- xd ( xn ' x ™'> 
q k+1 (x n \\x n - 1 ) 



q k {x n \\x n - 1 ) 



W , k ,„ n , n , A q k {x n \\x n - 1 )2- Xd ^ n ^ 

> 2^ p(x n )r k (x n \x n ) I 1 



^(x^x^Yjx™ q k (x n \\x"- 1 )2- xd ( x " ^ 
q k (x n \\x n - 1 ) 



q k+1 {x n \\x n - 1 ) 



= )r (x |» ^ j 
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fe+1 /'An \„n} 



J2 p(a n ||fi n )g fc+1 (s n ||x"- 1 ) ( 1 I*" X ) 



= + 0, 

where (a) follows from the inequality log(y) > 1 — ^, and (b) follows from Equation (l3TT l where q = q , r = r k+1 . 
Note, that we have strict inequality unless q k = q k+1 , r k = r +1 . Thus, K(r k , q k ) is non-increasing and is strictly 
decreasing unless the distribution stabilizes, and hence the uniqueness of the optimum parameter r* emerges. ■ 

Now, we can prove Theorem |4] as stated in Section [TT] 

Proof of Theorem]?} First, we have to show the existence of a double minimization problem, i.e., an equivalent 
problem where we minimize over two variables instead of only one; this was shown in Lemma [7] Now, in order for 
the alternating minimization procedure to work on this optimization problem, we need to show that the conditions 
given in Lemma|5]are satisfied for the functional K; this was shown in Lemma[6] The steps described in Alg.Q]are 
proved in Lemmas [8] and [9] thus giving us an algorithm to compute R n (D), where the minimization is evaluated 
according to parameter A. ■ 

Our last step in proving the convergence of Alg. [T]is to show why the stopping condition ensures a small error. For 
this reason we state a lemma introducing the existence of bounds to the rate distortion with feed-forward function, and 
then conclude that the stopping condition does ensure a small error in the algorithm, i.e., \R^(Dk) — R n (Dk)\ < e, 



where R, k (Dk) is the upper bound in the fcth iteration, and = E r k d(X n ,X n ) 
the following expressions in each iteration, 



For this purpose, we define 



c k 



q k (x n \\x n - 1 ) 



x n ) = ^^q k -\x n \\x n - 1 )2- Xd ^ n ' &n ^j . (32) 



Lemma 11 Let the parameter A > be given, and let c~„ ~{ k (x n ) be as in 

Q] Then, at point 



in the fcth iteration of Als 



D k = E rk d(X n ,X n ) 



we have the following bounds. 



I k L (D k )<R n (D k )<I k (D k ), 
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where 



lh{Dk) = \ (-A^ + ^^ n )log7 fe (^)- E P{x n )r\Z n \x n )\og4 ntXn -A , 

V x n x n ,x n I 

itiDk) = - (-\D + Y^p{x n )\o gl k {x n )~\og max <* n ^ j . (33) 

Note, that i2*(D fc ) = !&(!>*)■ 

The proof for Lemma QT| is given in App. [D] 

From Lemma [TT] we can conclude the following claim 

Corollary 1 Let us define the error in the algorithm as \R*(D) - R n (D)\. The error defined here is smaller than e 
if the following inequality is satisfied: 

F = log max c|„ - V p(z n )r fe (£ n |:r n ) logc|„ < e, 

where c|„ „_! is defined in the fcth iteration by Equation (f32l >. 

Proof: The proof follows from Equation (l33l l. in which the upper bound and lower bound differ only in their 
last expression. Thus, if F < e, then i?„ (£)) is close to the upper bound ii„ (D) by, at most, e. ■ 



VIII. Numerical Examples 

In this section we present several examples for the rate distortion source coding with feed-forward. First, by using 
Alg.[JJwe demonstrate, for a specific example, that feed-forward does not decrease the rate distortion function where 
the source is memoryless (i.i.d.) as shown in J3|- Then we provide two explicit examples for a Markovian source; 
one where the distortion is single letter, and one with a general distortion function as presented in [5 ]- Geometric 
programming is used as well, to verify our results. 

In all of the examples, we run Alg. [ljwith various values of A, and thus construct the graph of R n (D) using 
interpolations. Alternatively, one can use the geometric programming form and find, for every distortion D given 
as input, the rate R. 



A. A memoryless (i.i.d.) source 

Analogous to the memoryless channel, it was shown by Weissman and Merhav [3| that for an i.i.d. source feed- 
forward does not decrease the rate distortion function. In this example, the source is distributed X ~ B(^), and 
the distortion function is single letter, i.e., 

1 " 

d(x n ,x n ) = -Yd{x % , Xl ). 

71 * * 
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Running our algorithm with delay s = 1 and block length n = 5, we would expect to obtain the same result as 
with no feed-forward at all (as shown in |fl9] ch. 10.3.1]), which is given by 



R(D) 



(34) 



H b (p) - H b (D), 0<D< mm{p, 1 - p} 
0, D > min{p, 1 — p} 

Note that H b (p), H b (D) are the binary entropies with parameters p, D, respectively. Indeed, the function above 
and the performance of Alg. Q] coincide, as illustrated in Fig. [4] Note that the joint distribution p(x n )r(x n \x n ) is the 




Fig. 4: Rate distortion function for a binary source, and feed-forward with delay 1. The circles represent the 
performance of Alg. Q] regular line is the plot of (l34l . 

same as the one that achieves the analytical calculation, in which p(xi) = 0.5, and X © X <~ B(D). For D = 0.2 
and n = 3, solving the geometrical programming form using a Matlab code produces the rate R = 0.278072, 
which is close to R(0.2) using Equation d34b . The value of A turns out to be 6, which means that the slope at point 
(R = 0.278072,1?= 0.2) is -2. 
In the following example, we present the performance of Alg. Q]for a Markov source and a single letter distortion. 

B. Markov source and single letter distortion 

The Markov source is presented in Fig. This model was solved by Weissman and Merhav in [3| for the 




Fig. 5: A symmetrical Markov chain. 



symmetrical case p = q. We extend this model for the case of general transition probabilities p, q. The analytical 
solution for this example is detailed in App. [EJ there we show that for any n 



1 77 — 1 

R n (D) = -H b (n) + (mH b (p) + n 2 H b (q)) - H b (D). 

n n 



(35) 



26 



By taking n to infinity, we have 

R{D) = 7rii40) + 7r 2 H b (q) - Hb(D), 

where tt = [771,^2] is the stationary distribution of the source. In Fig. 0(a) we present the graphs of R n (D) for 
n = 1 up to n = 12, where p = 0.3, q = 0.2, and X has the stationary distribution [0.4,0.6]. It is evident that 
R n (D) decreases as n increases and converges to the analytical calculation. 

In lfl2l Lemma 6] we provided another estimator for the feedback channel capacities, namely, the directed 
information rate. There, we show that if the limit exists, then 

lim -I(X n Y n ) = lim (l(X n Y n ) - IfX"" 1 -> F" -1 )) . 

n— >oo n n— >oo 

We can also use the directed information rate to estimate R n {D). This is applied in two ways: either when the rate 
value is fixed or when the distortion value is fixed. In both cases we first have to fix an axes vector and interpolate 
the other vector with respect to the fixed one; then we can calculate differences between the interpolated vectors. 

In Fig. [6] (b) we present this estimator only for n = 12 where the vector of the distortion is interpolated, i.e., 
12Z?i2(i?) — llDn(R). We can see that this estimation is much more accurate than the one in Fig. 0(a). 




0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 

D D 

(a) (b) 



Fig. 6: R{D) for the Markov source example and feed-forward with delay 1. 
(a) Graph of R n (D)\ the arrow marks the way R n (D) responds to n increasing. The dashed line is the analytical 

calculation. 

(b) Graph of \2Di2(R) — HDn(R). The circles represent the performance of Alg. [T] 



This is a good opportunity to present the performance of the upper and lower bounds to a specific rate distortion 
pair (R,D), and the geometrical programming solution to this problem. We ran our BA-type algorithm for the 
specific parameters A = 9.216, n = 3 that corresponds to the rate distortion pair (R = 0.35884, D = 0.10627) at 
slope 9, g 16 3, this presented in Fig. [7] (a). We also ran ten distortion points using GP from D = to D = 0.27 
and compared it to i?3(D) as in d35l l and the BAA performance, the solution is in Fig. |7](b). 
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Fig. 7: Bounds for Rs(D) and performance of GP and BAA for R-s(D). 
(a) Graph of the upper and lower bounds as a function of the iteration for n = 3, A = 9.216 as given in 

Equationd33|>. 

(b) Graph of the solution using the GP and BAA method for n = 3. The solid line is Rs(D) as in ( 1351 . the 
circles represent the performance of the GP, and the dashed line is the BAA result. 



C. Stock market example. Markov source and general distortion 

The stock market example, in which we wish to observe the behavior of a particular stock over an A^-day period, 
was introduced and solved in [5;|. Assume the stock can take k + 1 values, < i < k, and is modulated as a k + 1 
state Markov chain. On a given day i, the probability for the stock value to increase by 1 is pi, to decrease by 1 is 
qi, and to remain the same is 1 — p. L — qi. When the stock value is in state 0, the value cannot decrease. Similarly, 
when in state k the value cannot increase. If an investor would like to be forewarned whenever the stock value 
drops, he is advised with a binary decision X n . X„ = 1 if the value drops from day n — 1 to day n, and X n = 
otherwise. The distortion is modulated in the following form 

1 - 

d(x : X ) y ; %i— 1 , %i ) 7 

n 

i=l 

where e(., ., .) is given in Table HU It was shown in [5| that the rate-distortion function of a general Markov-chain 

TABLE II: Distortion e(£i, Xj_x, Xi), j 6 {0,1,..., k} 

(^t— liX^) 







jj 






= 








i 


Xi 


= i 


l 


i 






source with k states, is given by 

fe-i 

R(D) = *i (H(pi,qi, I -Pi- qi) - H b (e)) + n k {H b (q k ) - H b (e)) , 
where 7T = [7ro,7i"x, ...,7Tfe] is the stationary distribution of the Markov chain, and e = . 
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In our special case we have k = 2, i.e., 2 states for the Markov chain, and transition probabilities pi = 0.3, qi = 
0.2 as illustrated in Fig. [5] The stationary distribution of such a source is tt — [0.4, 0.6], and we are left with 



R(D) = TT! (H b (q) - H b {e)) 



D 



= 0.6(H b (0.2)-H b (—)). 

Since the rate cannot be less than zero, and is a descending function of the distortion, the rate-distortion function 
is as above when H b (0.2) > H b (-^), i.e., when D < 0.12, and thus we obtain 



R(D) 



0.6(^(0.2) -H b (^)), £><0.12 



0, 



otherwise. 



(36) 



In Fig. a) we present the graphs of R n (D) for n = 1 up to n = 12 with the distortion described here and 
where Xq has the stationary distribution [0.4,0.6]. We can see that R n (D) decreases as n increases as expected 
and converges to the analytical calculation. In Fig. [8] (b) we present the directed information rate estimator only 
for n = 12, where the vector of the distortion is interpolated, i.e., 12£>i2(i?) — llDn(R). We can see that this 
estimator is much more accurate than the one in Fig. 0(a). 
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Fig. 8: R(D) for the stock market example and feed-forward with delay 1. 
(a) Graph of R n (D)\ the arrow marks the way R n (D) responds to n increasing. The dashed line is the analytical 

calculation. 

(b) Graph of 12Di2(R) — llDn(R). The circles represent the performance of Alg. [T] 



D. The effects of the delay on R n (D) 

In this example we use the Markov source (Fig. [5} example with a single letter distortion. We run Alg. Q]with 
delays s G {1,2, ..,10} and block length n = 10, where Xq has the stationary distribution. We expect the rate 
distortion function to increase with the delay s. This is expected because as the delay s increases the value of the 
directed information increases as well. Due to the fact that for s £ {3,4, 10} all graphs are close together, we 
present R n (D) only for s = 1, 2, 10, and the results are shown in Fig. [9] 
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Fig. 9: Rio(D) for a Markov source as a function of the delay. 



IX. Conclusions 



In this paper we considered the rate distortion problem of discrete-time, ergodic, and stationary sources with feed 
forward at the receiver. We first derived a sequence of achievable rates, {R n (D)} n >i, that converge to the feed- 
forward rate distortion. By showing that the sequence is sub -additive, we proved that the limit of R n (D) exists 
and thus equals to the feed-forward rate distortion. We provided an algorithm for calculating R n (D) using the 
alternating minimization procedure, and also presented a dual form for the optimization of R n (D), and transformed 
it into a geometric programming maximization problem. 



We start by showing that the sequence {R n (D)} is sub additive; the methodology is similar to Gallager's proof in 
J2] Th. 9.8.1] for the case of no feed-forward. Then, by showing that the sequence R n (D) is sub-additive, following 
J2] Lemma 4A.2] we obtain our main objective, i.e., 

\im R n (D) = inf i?„(L>). 

n n 

To commence, we recall that a sequence {a„ } is called sub-additive if for all m,l, 



Let l,n be arbitrary positive integers and, for a given D, let p n (x n \x n ) and pi(x l \x l ) be the conditional PMFs 
that achieve the minimum of the directed information with block length of n and I, i.e., that achieve R n (D) and 
Ri(D), respectively. Suppose we transmit m = n + l samples as follows; the first n samples are transmitted using 



Appendix A 



Proof of Lemma|2] 



(m + l)a m+ i < ma m + la t . 



30 



p n , and the sequential I samples are transmitted using pi. Hence, the overall conditional PMF is 

Pr^l(x^\x^)= Pn (x n \x n ) Pl (x^ 1 \x^ 1 ). 

We can see in Section [VT] that the directed information can be written as 

I(X m -> X m ) = H^WX" 1 - 1 ) - H{X m \X m ). 
From the construction of the conditional overall PMF p n +i, its clear that 

H(X n+l \X n+l ) = H(X n \X n ) + H(X%+ l 1 \X%+ l 1 ). 



Furthermore, 



n+l 

H (X rn \\X m - 1 ) = Y,H(X i \X i - 1 ,X i -' L ) 
»=i 

n+l 

= H(X n \\X n - 1 )+ HtXilX*- 1 ^*- 1 ) 

i— n+l 
n+l 

<H(X n \\X n ^) + g ff^lX^^X^) 

z— n+l 

= H(x n \\x n - i )+H(x:+ l 1 \\x:t l f 1 )- 

Thus, it follows that 

i(x n+l -> < -> x n ) + -». (37) 

Since the source is stationary, we can start the input block at any given time index; thus the PMFs p n and pi achieve 
nR n (D) + IRi (D) on the right-hand side of Equation (|37| |. while the left-hand side is greater than (n + l)R n +i (D) 
since we attempt to minimize the expression to achieve the rate distortion function. Hence, we obtain 

(n + l)R n +i{D) < nR n (D) + lR t (D). 

Using |2., Lemma 4A.2] for sub-additive sequences, we obtain 

inf Rn(D) = lim R n (D). 
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Appendix B 
Proof of Lemma|4] 



In this Appendix we prove Lemma|4] which provides for us that the mathematical expression for the rate distortion 
feed-forward 

R {I) {D) = lim - min I(X n ->• X n ), (38) 

n->oo n p(£»|x«):E[d(X",X»)]<£) 

is a lower bound to the operational definition R(D). 

Proof: Consider any (n, 2 nR , D) rate distortion with feed-forward code defined by the mappings /, {<?i}™ =1 
as given in Section ITU Equation (f3]), and distortion constraint E d(X n ,X n ) < D + e n , where e n — > as n goes 
to infinity. Let the message sent be a random variable T = f(X n ), and assume that the distortion constraint is 
satisfied. Then we have the following chain of inequalities: 



(a) 

nR > 








> 


I(X n ;T) 






(6) 


n 

YsHWlx 1 - 1 ) 

i=l 








n 
i=l 


H{X. t \X 1 ' 


\T)) 


(c) 


n 

^(HiXtlx^)- 

i=l 


HiX^X' 1 ' 


'\T 1 X^)) 


(d) 
> 


n 

^(HiXtlx*- 1 )- 

i=l 


HiXilX'- 


-\x i )) 
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^/pQ^lX*- 1 ) 

i=l 






(£_) 


I(X n -> X n ), 







where (a) follows from the fact that the alphabet of T is nR, (b) follows from the chain rule for mutual information, 
(c) is due to the fact that given X l ~ 1 ,T, we know X 1 , and (d) is since conditioning reduces the entropy. Step (e) 
follows the chain rule for directed information. Taking n to infinity, we obtain R > R^'(D), and the distortion 
constraint satisfies 



lim E 



d(X n ,X n ) 



< D. 
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Appendix C 
Proof of Theorem[6] 

In this appendix we provide a proof for Theorem |6] We recall that Theorem [6] states that the rate distortion 
function can be written as the following optimization problem: 

R n (D)= max — ( —XD + p(x n ) log ^(x 71 ) ] , (39) 

where, for some causal conditioned probability p'(x n \\x n ), r y(x n ) satisfies the inequality constraint 

p(x n )-f(x n )2- Xd( - x "'^ <p'(x n \\x n ). (40) 

We prove this theorem in two ways. One is similar to Berger's proof in lfl3l . based on the inequality log(y) > 1— i, 
for the regular rate distortion function. The other is using the Lagrange duality between the minimization problem 
we are familiar with and a maximization problem as presented in O and 1151 . We also provide the connection 
between the curve of R n (D) and the parameter A; this is embodied in Lemma [T2l 

Before we begin, we recall that a step in Alg. [T|is defined by the following equality 

„fc — 1 {~n I |„ra — 1\q — Ad(x™,x") 

r k (x n \x n ) = - [ 11 - (41) 

\* \- L J J2 rin q k ~ 1 (x' n \\x n - 1 )2- xd( . xn ' £ ' n y 



This equality is the outcome of differentiating the Lagrangian when q(x n \\ a;™ -1 ) is fixed, as given in Section [VTT1 
We shall use this equality throughout the proof. 

As mentioned, the first proof follows the one in ||T3l . 

Proof of Theorem® First, we show that for every r(x n \x n ) for which the distortion constraint is satisfied, 
the following chain of inequalities holds 

I FF (r,q) + XD log 7 (x") > Ij, J r(r > g)+AE P ( a »|x») [d(X n ,X n )\ - log 7^") 

x n x n 

K r (&n \ rr n\<)\d{x' n ,x n ) 

x n ,x n ' 

= 1- 9(i"lk"" 1 )p(a;")7(a; n )2" Ad(a; "' :i; " ) 

X n ,X n 

> i- q(x n \\x n ~ 1 )p'(x n \\x n ) 

X 71 ,x n 



where (a) follows from the fact that the distortion D exceeds E r (£ni x n-) d(X n ,X n ) for every r(x n \x n ) as has 
been assumed, (b) follows from the inequality log - > 1 — -, (c) is due to the constraint in Equation ( l40b . and (d) 
follows from the fact that <j(i™||x™ _1 )p'(a;"||i™) is equal to some joint distribution p(x n , £ n ) [6J. Since the chain 
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of inequalities is true for every r(x n \x n ), we can choose the one that achieves R n (D), and then divide by n to 
obtain the inequality in Equation ( T39b in our Theorem. 

To complete the proof of Theorem [6] we need to show that equality holds in the chain of inequalities above for 
some j(x") that satisfies the constraint. If so, let us denote by r*{x n \x n ) the conditional PMF that achieves R n {D). 
Further, we denote by q*(x n \\x n ~ 1 ) the corresponding causal conditioned PMF. Now, consider the following chain 
of equalities. 

r*(x n \x n ) 



nRJD) = V p(x n )r*(x n \x n )\og 

„r"f„ B q*{x n \\x n - 1 ) 

53 p(x n )r*(x n \x n )\og 



, s o-Ad(a: rl ,x rl ) 



(*>) 



= -A^ + 53p(x n )log 7 (x n ), 

where (a) is due to a step in the algorithm given by (flD . and by the uniqueness of r*(x n \x n ) in the algorithm, as 
shown in Lemma [TOl and (b) follows the expression for ~i{x n ) given by 



7 



(x n ) = (^q*< y x' n \\x n - 1 )2- Xd ^ n ' i ' n ^ . (42) 



Therefore, we are left with verifying that the j(x n ) above satisfies the constraint: 

r> — \d(x n ,x n ) 



p(x n )j{x n )2- Xd( - xn ^ = p(x n ) 



q*{x n \\x n - 1 )2- xd ( xn ^ n ) 
(a) p(x n )r*{x n \x n ) 
q*(x n \\x n - 1 ) 
p(x n ,£ n ) 



q*(x n \\x n - 1 ) 

p'(x n \\n, 



where (a) follows from Equation (|4TT >. and (b) is due to the causal conditioning chain rule. Hence, we showed that 
R n (D) is the solution to the optimization problem given in Equation ((39). ■ 

We also present an alternative proof for Theorem |6] this using the Lagrange duality, as in fl4l . Ifl5l . 

Alternative proof for Theorem IB) Recall that R n (D) is the result of 



mm J2 P(.x n )r(£ n \x n )\og^ 

( ^ n y, n \ £ J HIT 



r(x n \x n ) 



r(S»|x«) ^ v / vi' a q(x n \\x n - 1 )' 

x n ,x n 

where q(x n \\x n ~ 1 ) is defined by p{x n )r{x n \x n ), subject to the following conditions: 

p(x n )r(x n \x n )d(x n ,x n ) < D, 
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V/: J^r(x n \x n ) = l, 
V x n ,x n : r(x n \x n ) > 0. 

Let us define the Lagrangian as 

J(r,A, 7 ,M)= E ^")K^n^")log g ^|||n" ) i ) + A ( E P(x n )r(x n \x n )d(x n ,x n )-D 



5>o* n ) (E r (^K)-i) - E /^ n ,i>(*> n ), 



where fi(x n ,x n ) > for all x n ,x n . Differentiating the Lagrangian, J(r, A, 7, p), over the variable r(i"|a;™), we 
obtain 

Solving the equation grp^r^ = in order to find the optimum value, yields the following expression 

r(x n \x n ) = q{x n \\x n ' 1 )j\x n )2 t ^^- xd{xn ^\ (43) 



where r y'(x n ) = 2 p< x ™> . Multiplying both sides by ^"ilL"- 1 ) we are ^ w ^ constramt 

= p(x") 7 '(z")2 ii ^f i - Arf(x "^" ) 

>p(x") 7 '(a; n )2- Ad(a; "' i " ) , (44) 

where p{x n \\x n ) is induced by r(i ra |a; n )p(x n ). 

From lfl4l Chapter 5.1.3] we know that g(X, 7, p) = J(r* , A, 7, p) is a lower bound to R n (D). Substituting the 
minimizer r(x n \x n ) using Equation d43l . and the condition given by Equation ( PPfl i into J, we obtain the Lagrange 
dual function 

( -AD + „ log7'(a; n ), p^Ma^-M*",*") < p(x n \\x n ) 
9(\l ) = < (45) 
—00, otherwise. 

By making the constraints explicit, and since the minimization problem is convex, we obtain the Lagrange dual 
problem, i.e., R n (D) is the solution to 

max - ( -A£> + Vp(a;™)log7(x™) ) , (46) 

subject to 

V x n ,x n : p(x n ) 1 {x n )2- Xd ^ n ^ < p{x n \\x n ), 
A > 
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for the p{x n \\x n ) that is induced by r(x n \x n )p{x n ), and r(x n \x n ) is the optimal PMF. 

We use the notation of an optimal PMF if it achieves the optimal value. For example, the PMF r(x n \x n ) that 
achieves the minimum of the directed information given the distortion constraint, is optimal, we say that the PMF, 
p(x n \\x n ) is optimal, if it is induced by the optimal r(x n \x n ). Another example is the maximization problem in 
(1461 . We say that A, 7(2;") are optimal if they achieve the maximum value. Therefore, p(x n ||i Tl ) is optimal as well 
if it satisfies Equation d44V 

Now, we wish to substitute the constraint to 

V x n ,x n : p(a;")7(x")2' Ad(3; "'*" ) < p'(x n \\x n ), (47) 

for some p'(x n \\x n ). First, note that we always achieve equality in ( |47| i since we can increase the value of j(x n ) 
and thus increase the objective. This, combined with the fact that for r{x n \x n ) > 0, fi(x n ,x n ) must be zero, we 
have equality in ( PPfl i as well (if r{x n \x n ) = 0, then q{x n \\x n ~ l ) = 0, and Equation d43T > holds too). Now, let us 
assume that the maximum in d46| i with the constraint in d47| i is achieved at a non-optimal p'(x n \\x n ), i.e., one that 
is not achieved using the optimal \,-f(x n ). Thus, the value obtained in (|46| i is larger then the value achieved by 
p{x n \\x n ), i.e., R n (D) (since the maximization includes p(x n \\x n )). However, from the lagrange duality it should 
be a lower bound to R n (D), thus contradicting the fact that the maximum is achieved at a non-optimal p'(x n \\x n ). 

■ 

Note, that we can construct the optimal PMF r(x n \x n ) from the solution to the maximization problem presented 
here. Consider the parameters A, 7(2;"), that achieve 1461 . and calculate p(x n \\x n ) according to Equation ( l44l . 
The calculation of r(x n \x n ) is done recursively on r{x l \x l ). For i = 1, calculate r(a; 1 |3; 1 ) using 

Further, calculate q(x\) using 

Now, once we have r(x : '\x^), q(xj\x : >~ 1 x : '~ 1 ) for every j < i, calculate r(x l \x l ) using 



Xjlx-i 1 x-' 1 ) 



p{x l ) 
and then 

J2 Xi p{x l )r(x l \x 



Y,xiP( xl ) r (x l \x l ) 

p(x t ~ 1 )r(x l ~ 1 |a; 4_1 ) 



q{xi\x l x % ) = 



p(x z 1 )r{x l 1 \x l 1 ) 

Do so until i = n, and we obtain our optimal r(x n \x n ). 

Another lemma we wish to provide is the connection between the curve of R n (D) and the parameter A. This 
lemma is similar to the one given by Berger in ifPH Th. 2.5.1] for the case of no feed-forward. 
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Lemma 12 Consider the expression for R n (D) given by 



Rn(D) = i \ -\D + J2p(x n )lo gl (x n ) \ , 



where 7(x n ) and A are the variables that maximize (|46V We have seen that 7(2;") is of the form 

j(x n ) = (^2q*(x n \\x n - 1 )2- xd(xn ^ . 
Hence, the slope at distortion D is R' n (D) = — — . 

Proof: The proof is given simply by differentiating the expression for R n (D). 

8R n d~/(x n ) 



dR n 


dR n dX 


dD 


+ dX dD 


1 




X - 










n 




d£> 


A 


1 








h - 


-£> + 


n 


n 





E 



E 



<97(V l ) dD 

p(x n ) dry{x n ) 
7 (a;") dD 

p(:r n ) (fry (a:") 
7(x n ) dA 



dA 
dD' 



Now, consider the following expression 

F= ^ p(x n )q*(x n \\x n - 1 )-y(x n )2- Xd ( xn ' s ' n 

Using the j(x n ) given above, we have F = 1 and thus §j = 0. However, 



~dX ~ ^ 



dj(x n ) 



dX 



d(x n ,x n )j(x n ) 



p(o^)q*(x n \\x n - 1 )2- Xd ^ n ^ 



E rfA P(^")E g *^"ll- T "' 1 ) 2 ' Ae ' ( "' 1 ' a " ) _ E ^")'7*(i™lk n_1 )2~ Ad(3; "' £ ' l) 7(a;")d(x n ,x n ) 



E 



d7(a; n ) r 
dA 3 

d-y(x n ) pp n ) 
dA 7(2;™) 



2 p(x™)r*(x n |x n )d(a; n ,i;") 



^ d 7 (x") 



dA j(x n ) 



0. 



Hence, we can conclude that 



dR n 
~dD 



X 1 
— + - 

n n 

A 
n 



^ d 7 (x") 

7(x") dA 



dA 
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Appendix D 
Proof for LemmaHTI 

In this appendix we prove the existence of a sequence of upper and lower bounds to R n (D), the rate distortion 
function with feed-forward. These bounds correspond to an iteration in Alg.[T] and both converge to R n (D). To this 
end, we present and prove a few supplementary claims that assist in obtaining our main goal. Theorem [6] provides 
an alternating form (Lagrange dual form) of an optimization problem achieving R n {D), that is proved in App ICl 
In Lemma Qj] we show that in each iteration we can obtain measures that satisfy the constraint in Theorem [6] to 
form a lower bound, and that the bound is tight and achieved as the upper bound converges. We also provide a 
proof for the existence of a an upper bound in each iteration. 

Before we begin, we recall that a step in Alg. |T] is defined by the following equality 

j, q^~ * (aS™||a; n— * )2 — ^d(x n ,x n ) 

r F ) — J2„ t gfc-l(f/n|| x n-l)2-^(a;",£"') ' 

We shall use this equality throughout the proof. 

As mentioned, we use Theorem [6] that provides us with the following alternating optimization problem. 

R n (D)= max - I -XD + »«) log 7 «) ] , (49) 
\>o,-f(x n ) n \ j 

where j(x n ) satisfies the inequality constraint 

p(x n )-f(x n )2- Xd( - x "^ <p'(x n \\x n ) (50) 

for some causal conditioned probability p'(x n \\x n ). 

We now show that in each iteration in Alg. Q] choosing j(x n ) appropriately forms a lower bound for R n (D). 

Lemma 13 In the kth iteration in Alg. Q] by letting 

i k {x n ) = (^2q k - 1 (x n \\x n - 1 )2- Xdix "' i; "^ , (51) 

and 



c x n .x n - 1 «_ni (52) 



q k (x n \\x n - 1 ) 



X " ,X 



q k - l (x n \\x n - 1 )' 



and defining 



7 k(x") = , (53) 

max^n _n-i c.„ __, 



the constraint in Equation d50b is satisfied, and forms a lower bound given by 



Rn(D) > - [ -XD + Y p{x n ) log 7 k {x n ) -log max c k n K „_ 1 
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Furthermore, this lower bound is tight, and is achieved as R k (D) converges to R n (D), where R k x (D) is the upper 
bound. 

Proof: Let us fix the parameter -f' k (x n ) as in ( BTT i. Hence, 

p{x n )i k {x n )2- Xd(xn > ±n) =p(x n ) 



2 — Xd(x n ,x n ) 



qk~l ^n^j,n — l^ ( 2~Xd(x rl ,x n ) 

(a) p(x n )r k (x n \x n ) 



q k - 1 (x n \\x n - 1 ) 
(b) p'(x n \\x n )q k (x n \\x n - 1 ) 



q k - 1 {x n \\x n - 1 ) 



< p'{x n \\x n ) max 



^(^"lla;"- 1 ) 



£»,x— 1 q k - 1 (x n \\x n - 1 ) 

where (a) follows from the definition of a step in Alg.[T]and given above in Equation (1481) . and (b) follow the chain 

o(a; n )r k {x n \x n ] 
q k (x n | \x n ~ 1 ) 



rule of causal conditioning, and p'(x n \\x n ) = p< fktl^\\^~ X i^ is a causal conditioned PMF. Hence, combined with 



, we obtain 

p ( *» )7 V)2-^-> = p(*w)2-^> 

< p'(a;"p n ). 

Thus, we can use Theorem [6] and obtain a lower bound for R n (D), i.e., 

-AL> + ^p(x n )log 7 fc (a;™) 

-XD + J2p(* n ) log7»» - E^ 1 ") 1 ^ ( max , 4v 

x n x n \ 

-XD + J2p(x n ) logj' k {x n ) - log ^ max i 4„ x „-^j 



Rn{D) > - 

n 



(54) 



To complete the proof of this lemma, we are left to show that as k increases, i.e., the upper bound converges to 
R n (D), the lower bound is tight. For that matter, we note that the PMFs that achieve the optimum value q* , r* 
are unique, as shown in Lemma [10] Thus, it is clear that 



q*(x n \\x n - 1 ) 



= 1, 



and 



7 V l ) = YV l ) 



^J2q*(x n \\x n - 1 )2- Xdix ^ S;rl ^ 



(55) 



(56) 



Placing Equation ( T56*] l and d55l ) in Equation ( 1541 ). as shown in Theorem [6] achieves equality instead of the chain 
of inequalities given. Thus R n (D) is, in fact, the solution to the optimization problem given in Equation (l49l ). and 
we have demonstrated the existence of the lower bound ■ 
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Lemma 14 In the kth iteration in Alg. Q] the upper bound to the rate distortion is given by 

Rn(D k ) < i (-A^ fe + ^p(x")log 7 fc ( a; ™) -^p(x")r fc (i n |a;")log4„ j:c „_ 1 



where Di- = E„ 



d(X n ,X n ) 



Proof: Note, that if r k (x n ,x n ) produces a distortion D, then 

nR n {D) <I FF {r\q k ) 

r k(A,n\ T n\ 

= p{x n )r k {x n \x n )\og- 



q k (x n \\x n - 1 ) 



( = } ^ p(x> fc (*"|*")log 



q k (x n \\x n ~ 1 ) J2&n qk-^(x' n \\x n - 1 )2- Xd ( xn ' i ' n ) 
AE rfc -^p( 2; ")log^ (Z fe - 1 (x'"||x"- 1 )2- A ^"- £ '")- ^ p( x «) r fc (i»| x ")log^ 



<f (f n || 



( = ) -A^ fc + ^p(x")log 7 fc ( a; ")- ^ pCar^^l^JIog^n-i, (57) 

where (a) follows from the definition of a step in Alg. [T]and is given above in Equation (08]), and (b) follows from the 
definition of j k (x n ), c k n n _i. Hence, we have formed an upper bound to the rate distortion as in the lemma. Note 
that the only inequality is in the first line of the chain, and is due to the fact that lFF{r k ,q k ) > min rig lFF{r, q). 
However, upon convergence, this inequality is tight. ■ 
We can now conclude our main objective in this appendix. 

Proof of Lemma [771 Proving this lemma requires us to present upper and lower bounds that converge to R n (D). 
Lemma [13] provides us with a lower bound and its tightness, whereas Lemma [14] provides us with a tight upper 
bound as well, as required. ■ 

Appendix E 

Solution to R(D) for an asymmetrical Markov source. 
The Markov source is presented in Fig. [5] above. We can describe the process {Xi} using the equation 

X t = Xi-jW! + (1 - Xi_i)W 2 
= {X i _ 1 (W 1 ®W 2 ))®W 2 , 

where Wi ~ B(q), W 2 <~ B(jp). This allows us to evaluate H(X n \X n _i): 

H{X n \X n - 1 ) = H((X n -i(Wi ® W 2 )) ®W 2 \X n ^ x ) 

= p{x n _ x = l)H(W 1 ®W 2 ® W 2 ) + p(a n _ x = 0)H(W 2 ) 

= TT 1 H(W 1 )+TT 2 H(W 2 ), 
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where n is the stationary distribution of the source. Now, to find the rate distortion of this model, we start with the 
converse 



1 



-I(X n -»• X n ) = H{X n ) - H(X n \\X n ) 



1 — 1 1 . . 

-H(Xi) + H(X n \X n -!) - -Y^HiX^X 1 - 1 ,^) 



> -H b (7r) + 7 ^J-H(X n \X n ^) --J^HiXilXi) 
n n n * — ' 

i— 1 

( fe ) 1 77 — 1 

> -H b (n) + H(X n \X n ^) - H b (D) 

n n 

1 n — 1 

= -H b (n) + (7TiH b (p) + n 2 H b {q)) - H ( D), 

n n 

where (a) follows from the fact that conditioning reduces entropy, and (b) follows the fact that P(Xi ^ Xi) < D 
and H b (D) increases with D for D < h. 

However, we can achieve it by letting Xi depend on Xi and Xi-\ as in Fig. [10] where pi, p 2 must hold for 



n 

n - 1 



i=i 




Fig. 10: Distribution of Xi given Xi-\ and Xi. 



the following equation 

piD + (l-p 1 )(l-D) = l-p ) 
p 2 D + (l-p 2 )(l-D) = l-q, 

i.e., 

D —p 



Pi = 

P2 = 



2D-V 
D-q 
2D-1 



Note, that under this construction, the source X n is still Markovian. Further, from Fig. [T0]we can see that — 
Xi — Xi forms a Markov chain, and H(Xi\Xi) = H b (D). Thus, we obtain equality in (a), (b) in the above chain 
of inequalities, and hence showed that 

1 n — 1 

R n (D) = -H b (n) + {mH b (p) + n 2 H b (q)) - H b {D). 

n n 
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By taking n to infinity we obtain 

R{D) = 7rii40) + 7r 2 H b (q) - H b {D). 
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